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Abstract 

F.Giroire has recently proposed an algorithm which returns the approximate number of distincts 
elements in a large sequence of words, under strong constraints coming from the analysis of large data 
bases. His estimation is based on statistical properties of uniform random variables in [0, 1]. Here we 
propose an optimal estimation, using information and estimation theorjQ. 

1 Introduction 

The aim of this note is to improve a solution proposed by Giroire [Gir05| to the following problem: consider 
a sequence Y — (Y"i, . . . , Yjy) of words (one may think to a sequence of file on a disk, a list of requests, a 
novel from Skakespeare, etc.); we don't make any assumption on the structure of Y, and we want to know 
the number (usually denoted Fq in the data base community) of distinct elements of this sequence. The mo- 
tivation comes from analysis of large data sets, and especially analysis of internet traffic: certain attacks may 
be detected at router level, because they generate an unusual number of dictinct connections (see |Fla04j ) . 
Most of algorithms use a dictionnary to store every word, so that the memory needed is linear in Fq. Here 
the size of data sets is huge, making it impossible to store every word, so that the algorithm should satisfy 
the two following constraints: it should use constant memory and do only one pass over the data. These 
constraints are very strong, but on the other hand we allow the algorithm to give only an estimation of Fo. 

The main idea used in [Gir05], introduced by Flajolet and Martin |FM85| . is to transform this problem 
in a probabilistic one, using hash fuctions. 

A hash function is a function h : C — > [0, 1], where C is a finite set of words (say english 
language, {0, l} 8 , etc..) such that the image of a typical sequence of words behaves as a 
sequence of i.i.d random variables, uniform in [0, 1]. 

This definition is of course somewhat informal, but we will assume, from now on, that, noting Xj = h(Yi), 
then X = {X\, . . . , Ajv} is the realization of Fq i.i.d. r.v., uniform on [0, 1]. Existence and construction of 
good hash functions is discussed in [Knu73j . 

Set 9 = F and denote as usually Xm the smallest JQ, X( 2 ) the second smallest, and so on. The key point 
is that the information on 9 contained in {Y 1; . . . , Yjv} is equivalent to that contained in (Xn\, . . . , X/g\). 

As a consequence, we are now dealing with a classical statistical problem: given a (small) sample of 
(X\, . . . , Xg), i.i.d. r.v., uniform on [0, 1], we want to estimate the (large) parameter 9. Denote by M the 
memory available (how many real numbers that can be stored). One should determine: 

1. A way of extracting a M-sample of X (the M smallest, the M with the longest sequence of zeros in 
their binary representation, etc...). 

2. A function £ : [0, 1] M — > K which approximates 9, when applied to this M-sample. 

State of the Art. Flajolet and Martin [FM85] have used these ideas to construct an algorithm based 
on research of patterns of 0's and l's in the binary representation of the hash ed values X\, . . . , Xg. It has 
been improved by Durand and Flajolet |DF93| . Bar-Yossef et alii lBYJK+02] . have proposed 3 performant 
algorithms, their ideas have been generalized by Giroire [Gir05| . 

In a different way, Alon, Matias, and Szegedy consider estimation by moment method, making implemen- 
tation proposed in [FM85] easier. For a nice survey about these ideas one may read |Fla04j . 
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Giroire's algorithm. The starting idea in [Gir05 is to use this simple property: 



MX, 



(i) 



9 + 1 



Consequently, a naive algorithm would hash every data, compare it to the smallest hashed value already 
seen, and finally return X/Xny Unfortunately, E[l/.X"m] = oo. However, 1/X( 2 ), l/-^(3) • • ■ have finite 
expectation. This leads Giroire to propose an algorithm which return a function of Xn.\, for some fc. In order 
to improve the precision of such an algorithm, one may wish to execute it m times with m different hashing 
functions, but this would cost too much time. Therefore Giroire uses stochastic averaging, introduced in 
[FM85]: the idea is to simulate m different experiments, by dividing [0, 1] in m intervals. 

Algorithm 1. 

let fc, m be integers, initialize (X(ivj, . . . , Xi^Yi, i = 1, . . . , m) with Xr p ) t i — for all i,p. 
for j = 1 to N 

x 3 = K Y o)- 

let i the integer such that Xj lies in ^r[- 

update the k-dimensional vector of k smallest values Xm^, . . . , Xn.\^ lying in ^, — [. 
next j . 

for all p, i, renormalize -XVp) j = m (^(p),i — ■ 

return an estimator £ — £(.XVjw; i = 1, . . . , m; I = 1, . . . , k). 

Thus we get m vectors in M. k . -X"(fcV, is the fc-th smallest hashed value lying in [^-, —], renormalized to 
get a real in [0, 1]. If less than I values have fell in the i-th interval, then X^^ = 1. Obviously, Algorithm 
[1] makes only one pass over each data Y^. Memory used by the algorithm is indeed M, if we have chosen 
fc • m = M . The estimation returned by the algorithm does not depend on any assumption on the repetitions 
in the sequence Xi, . . . , Xn. 

Giroire [Gir05j proposes 3 estimators ^1,^2,^3, using inverse function, square root function and log re- 
spectively. For example, 

/ r(fc-i/m) \- m -i E » llog * Wi< 

43 • ^ r(fc) j - e 

For each fc, m these estimators are asymptotically unbiased, i.e. E[&] ~ 9 when 9 goes to 00. Their variances 
are all about 1/km. Here we give a fourth estimator, which is also asymptotically unbiased: 

km — 1 



i=l A (k), 



Plan Using information and estimation theories, we first show that the estimator £ is optimal under a 
simplified model, that we call the independent model. Then we discuss its actual optimality. 



2 The best estimation under the independant model 

In this section, =*> denotes the convergence in law. Recall that a real-valued random variable is said to follow 
the Gamma law with parameters (fc, 9) if 

P(X e [t,t + dt[) = ffj9 k e- 9k l t > dt. 

The asymptotic behavior of the minimum -XVi) of 9 random uniform variables in [0, 1] is well-known (see for 
example [Fel70 ]): X^ =>■ 71, where 71 follows the Gamma(l, 9) law. More generally, one can prove here the 
following convergence 

{QX (k):1 , . . .,9X (k):m ) =>[9-> oo]£(7i, . . .,7 m ), (1) 

where the 7$ are i.i.d. r.v. of law Gamma(fc,l). Consequently, we assume in this section that the X^ i are 
i.i.d. r.v. of law Gamma(fc,#), this is the so-called independent model. We set 

~ km — 1 
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Remark 1. This estimator depends only on the m values — 1, . . . , to), not on the km ~ m other 

hashed values stored by the algorithm. This follows from the fact that the knowledge of these values does not 
provide any information about 9. For a given i, conditionnally on XnA^, the r.v. (Xnw, . . . ,X(u^x),i) are 
distributed uniformally on [0,X^ i\. 

A simple calculation shows that under the independent model, 

E[0] = 9, 

Var(0) = -f—. 

km, — 2 

This is indeed better than the 3 estimators proposed in [Gir05 . 

Recall a few definitions in Statistics (see for example [Leh83p : given a m-sample of i.i.d. random variables 
X%, . . . , X m of some law Pg, any random variable S — S(X\, . . . , X m ) is called a statistic. Here we consider 
the statistic S = J>2 i=1 -^VfeYi- 

Definition 1. A statistic S is sufficient for the parameter 9 if and only if, conditionnally to S , the law of 
(X\, . . . , X m ) does not depend on 9. 

More informally, S is sufficient if, given S, the knowledge of [X\, . . . ,X m ) does not give any information 
on 9. S is sufficient. 

Definition 2. A statistic S is complete if whenever h(S) is a function of S for which E[/i(T)] = for all 9, 
then h = 0, Pg almost everywhere. 

Here, the statistic S = YliLi -^{k),i IS complete and sufficient. Some simple criterions to check sufficientness 
and completeness are given in [Leh83j. Complete sufficient statistics share the following useful property: 

Theorem 1 (Lchmann-Schcffc). Let S be a sufficient and complete statistic. Lett;* be another unbiased (i.e. 
E[£] =9) estimator of 9. Among all the unbiased estimators of 9, ¥.[£*\S] has a minimal variance. Such an 
estimator is said to be efficient. 

Corollary 1. Let £ another unbiased estimator of 9. Under the independent model, 

E[(f-0) a ]>E[(|-0) 2 ]. 

Remark 2. Note that Var(^) is about 9 2 . This optimal bound does not depend on the algorithm, see JPDOSf . 



3 Optimality in the real model 

From now one places oneself in the exact model: Xi p \i is the p-th smallest realization of 9 i.i.d. r.v. uniform 
on [0, f], among the values lying in [ 1= —, —]■ When i ^ j, there is now dependancy between -XVfcYi and Xt^ j. 
Set P and Pi n d the laws corresponding respectively to the exact and independant models, E and E; n a the 
corresponding expectations. 

Lemma 1. Let A be the event 

A = A k , m fi : "for all i = 1, . . . , to, X^ k)ti < 

(i.e. at least k hashed values have failed in each of the m intervals). From a classical inequality (see fBol85l) 
we get 

> 1 - 2me~^. 

Here is the main result: 

Theorem 2 (Optimality in the exact model). Let £(X), with X = {X\, ■ ■ ■ ,X m ) another estimator of 9. 
Let b(9) be the bias Eg[0 — 9]. We assume 

1. b(9) = o(Ve). 

2. There exists a constant C such that |£(x)| < J^n, for every x in W n , || x || large enough. 
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Then 

E g [(e-ef]>E e [(e-ef] + o(e). 

Proof. First write 

E[(6> - ef] = e[0 - 8) 2 i A ] + me - ef(i - i A % 

= E{(6-6) 2 l A ]+o(6), 
using Lemma [TJ We now bring back ourselves to the independant model: 

i[(0 - ef] = E ind6 {(e - ef] + (E[(e - efi A ] - E ind6 [(e - ef]) + (A), (2) 

The key point is the fact that conditioned to the event A, the r.v. (Xn-\x, . . . , X^ m ) admit a density toward 
the Lebesgue measure on R+. 

E{(e-ef]-E ind6 [(e~ef] = 

J X\ ... X m X\ ... %rn) dx\ . . . dx m 



[0 x ]m \(k-i)...(k-i) 

-.i*- 1 

r(fc) ■■■T(tJ 



xi" 1 x m " 1 nkm -6(x 1 + ...+x m ), , 



We omit the proof of the following lemma: 
Lemma 2. 



l)..^-2 fc+ l) (1 _ ^ _ _ _ _ _ Xm) 0-2k _ e -fl(* 1 + ...+x m )| < c ste e{xi + _ + a . m) 2 e -9(x 1+ ...-H em ) ) 



02fc 

where the constant depends neither on e nor on the Xi 's 
Hence 



E[(e-ef]-E ind6 [(e-ef]\ < 

\e( Xl , ...,x m )- e\ 2 ^- . . . ^f^e km { c stc e( Xl + . . . + x m f e - 9 ^+-+ x ^}dx 1 . ..dx m + 0(e) 



'[0, 

Set i/i — 6xi, i — 1 ... m in the integrand: 



< 



[o. 



\e~(f, . . . , - ef^- . . . ^e- 1 { c stc e( m + ... + y m fe- e ^ + - + y^}d yi . . . d Vm + o(e) 



Here we use the hypothesis made on the estimator: #(x) < j^. We also need the following arithmetico- 
geometric inequality: 

( ai ...a m ) a < \{ ai 2 + ... + a m 2 ) ma ' 2 . 

Set y = (yi, . . . , y m ), one gets 



[o,£]™ II y II 2 



(II y ||) rn(fe - 1) ^- 1 (j/i + ■ • • + yrnfe-^— y ™d yi . . . dy m + 0(e) 



< c stc I -%{\\ y lir^-^r^H y \\fe-^d yi . . . dy m + 0(6). 

Jio&r II y II 



Here we make a "polar-like" change of variables in IR m . We get this inequality: 

\E[(6 - ef] - E ind6 [(<9 - 6f]\ < c stc e / r a e - r dr + 0(0), a > 0. 



0(6). 
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@ has become 

E[(0~0) 2 } = E ind6 [(0-0) 2 ] + O(0) 



= E ind6 [(0 - b{0) - 0) 2 } + b 2 (0) + 0(0) 
>E ind6 [(0-0) 2 ] + O(0), 



□ 



References 

[Bol85] B. Bollobas. Random graphs. Academic Press, 1985. 

[BYJK+02] Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements 
in a data stream, 2002. 

[DF93] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities. 11th Annual 
European Symposium on Algorithms (ESA03), 1993. 

[Fel70] William Feller. An Introduction to Probability Theory and its Applications, Vol. I. John Wiley 
& sons, 1970. 

[Fla04] Philippe Flajolet. Counting by coin tossings. ASIAN'04, pages 1-12, 2004. 

[FM85] Philippe Flajolet and Nigel Martin. Probabilistic counting algorithms for data base applications. 
Journal of Computer and System Sciences, vol 31(2), pages 182-209, 1985. 

[Gir05] Frederic Giroire. Order statistics and estimating cardinalities of massive data sets. DMTCS 
proceedings, International Conference on Analysis of Algorithms:157-166, 2005. 

[Knu73] Donald Knuth. The Art of Computer Programming, vol. 3 : Sorting and Searching. Addison- 
Wesley, 1973. 

[Leh83] E.L. Lehmann. Theory of Point Estimation. John Wiley & sons, 1983. 

[PD03] P.Indyk and D. Woodruff. Tight lower bounds for the distinct elements problem. Proceedings of 
the 44th IEEE Symposium on Foundations of Computer Science, Boston, pages 283-290, 2003. 



Institut Elie Cartan Nancy (IECN) 
Universite Henri Poincare Nancy I 
BP 239 - 54506 Vandoeuvre-Les-Nancy Cedex - France 



{chassain , ger in}@iecn . u-nancy . f r . 



5 



